Class 14: Interactive Data visualizations¶
Plan for today:
- Quick review of seaborn
- Discuss interactive graphics using plotly
- If there is time: Discuss creating maps
import YData
# YData.download.download_class_code(14) # get class code
# YData.download.download_class_code(14, True) # get the code with the answers
If you are using colabs, you should install the YData packages by uncommenting and running the code below and run the code below to mount the your google drive.
# !pip install https://github.com/emeyers/YData_package/tarball/master
# from google.colab import drive
# drive.mount('/content/drive')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Review of seaborn!¶
Seaborn is a Python data visualization library based onĀ matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
I.e., it is built on top of of matplotlib but produces better looking plots that are easier to create.
Let's start by examining different themes which can produce better looking plots. We can do this using the sns.set_theme() method.
# Import seaborn
import seaborn as sns
# Apply the default theme
sns.set_theme() # default style is 'darkgrid')
#sns.set_theme(style='whitegrid')
# Side note: Matplotlib also has themes
# plt.style.available
# plt.style.use('fivethirtyeight')
Penguins!¶
Let's get a little more practice with seaborn by continuing to explore the penguins data set.
# Let's look at some penguins
penguins = sns.load_dataset("penguins")
penguins.head()
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
Plotting a single quantitative variable using sns.displot()¶
We can plot a single quantitative variables using the sns.displot() function.
Properties we can set include
x: The name of the data column you want to plothue: The name of the column that colors each pointkindThe type of plot
Different options for kind are: āhistā, ākdeā, āecdfā
Warm-up exercise 1¶
Please create a sns.displot() to create a visulation of flipped length, where each species is in a different color (i.e., different hue). Also, experiment with the "kind" of visualization and choose the kind you think creates the best visualization.
# plot the flipper length
sns.displot(data = penguins,
x="bill_length_mm",
hue="species",
kind="hist"); # Experiment with "hist", "kde" and "ecdf"
Pairs plots¶
One of the most useful visualizations for exploring the relationships between several quantitative variables is to create a "pairs plot" which creates a series of scatter plots between all quantitative variables in the data. We can do this in seaborn using the sns.pairplot(data) function!
Warm-up exercise 2¶
Use the pairplot() function to visualize the relationships between all columns in the penguins DataFrame. Also, make each species have a different color.
# Create pair plots for the different varaibles in the penguins data set
sns.pairplot(penguins, hue = "species");
Interactive data visualizations with plotly¶
Let's now look at interactive visualizations using the plotly express package.
Interactive visualizations can't be used with statitic report (such as the pdf used for your class project) but they are useful for exploring data to understand key trends, and these types of graphics can be embedded in webpages.
Let's start with our favoriate data set to visualize, the gapminder data! The gapminder data comes with the plotly package and can be loaded using the code below.
import plotly.express as px
# Newly added
import plotly
plotly.offline.init_notebook_mode()
gapminder = px.data.gapminder() # the plotly package comes with the gapminder data
print(type(gapminder))
gapminder.head(3)
<class 'pandas.core.frame.DataFrame'>
| country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 | AFG | 4 |
| 1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 | AFG | 4 |
| 2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 | AFG | 4 |
Let's now get the the gapminder data from 2007. As you know, we can do this using Boolean masking. We can also do this using the .query() method!
# Get the gapminder data from only 2007
gapminder_2007 = gapminder[gapminder['year'] == 2007]
gapminder_2007_alt = gapminder.query("year==2007")
gapminder_2007.equals(gapminder_2007_alt)
True
Line plots¶
Let's create a line plot showing life expectancy as a function of the year using the px.line() method. In particular, let's set the followign properties of the plot:
x: Yeary: Life expectancycolor: The continentline_group: The countryhover_name: The countryline_shape: splinerender_mode: svg to use svg graphics
What do you think of this plot?
# Create an interactive line plot
fig = px.line(gapminder, x="year", y="lifeExp",
color="continent",
line_group="country",
hover_name="country",
line_shape="spline",
render_mode="svg")
fig.show()
Scatter plots¶
Let's now recreate our scatter plot of country life expectancy as a function of GDP per capita using the gapminder_2007 data using plotly. In particularly, we can use the px.scatter(data_frame = , x = , y = , ...) method which works similar to seaborn's sns.relplot() function.
Let's try out the px.scatter(data_frame = , x = , y = , ...) function use the following mappings:
x: GDP per capitay: Life Expectancysize: The country populationcolor: Continent
We can also set the following properties:
hover_name: The name of the countrylog_x: Set it to True to make the x-axis on a log10 scalemax_size: Set it to 60 to make the scaling for the population display better
Finally, if we want to have separate facets for columns we can use facet_col.
# Create a scatter plot in plotly
fig = px.scatter(data_frame = gapminder_2007,
x="gdpPercap",
y="lifeExp",
size="pop",
color="continent",
hover_name="country",
log_x=True,
size_max=60)
# Add axis labels
fig.update_layout(xaxis_title="GDP per capita ($)",
yaxis_title="Life Expectancy")
fig.show()
Animations¶
We can also add animations to out plots using the following arguments:
animation_frame: defines which variable to animate over; i.e., each frame in the animation will be one value of this variable.animation_group: Values from this column or array_like are used to provide object-constancy across animation frames: rows with matchinganimation_groups will be treated as if they describe the same object in each frame. This allows the animation to smoothly interpolate between frames.
We can also set the x and y ranges of our plots to match the ranges of data over the full animation sequence.
range_x: The range that the x-values should takerange_y: The range that the y-values should take
# Create an animated scatter plot
fig = px.scatter(gapminder,
x="gdpPercap",
y="lifeExp",
animation_frame="year",
animation_group="country",
size="pop",
color="continent",
hover_name="country",
facet_col="continent",
log_x=True,
size_max=45,
range_x=[100,100000],
range_y=[25,90])
fig.show()
Additional visualizations¶
There are a number of other visualizations we can create using plotly. Let's briefly explore line graphs, sunburst plots and treemaps.
Please see the plotly express documentation to learn more about other plots you can create: https://plotly.com/python/plotly-express/
Sunburst plots¶
Sunburst is a generalization of a pie chart for data that has a hierarchical structure; i.e., it can plot categorical data that has a hierarchical structure.
Let's create a sunburst plot showing how much of the world's population is in each continent at the inner level, and then each country within each continent at the outer level. In particular, let's set the following properties:
path: Should be a list with continent at the inner level and country at the outer level.values: Should specify that the angle of each segment is given by the countries populationcolor: Set to the countries' life expectancies
What do you think of this plot?
# Create a sunburst plot
fig = px.sunburst(gapminder_2007,
path=['continent', 'country'],
values='pop',
color='lifeExp')
fig.update_layout(width = 500, height = 500)
Treemap¶
Treemaps allow one to view hierarchical relationships by creating a sequence of nested rectangles. We can use plotly's px.treemap() function to create interactive tree maps.
Let's create an interactive treemap showing the population of each country separately for each continent, as well as color each country based on the average life expectancy. In particular, let's set the following properties:
path: Should be a list with continent at the highest level and country nested within continent. We can also set the first argument of the list to bepx.Constant('world')so that at the highest level we get the label "world".values: Should specify that the size of each rectangle is equal to a country's populationcolor: Set to the countries' life expectancies
What do you think of this plot?
# Create a treemap
fig = px.treemap(gapminder_2007,
path=[px.Constant('world'), 'continent', 'country'],
values='pop',
color='lifeExp')
#color='gdpPercap')
fig.show()
Pivot tables and heatmaps¶
Heatmaps allow us to view data that is a function of two variables.
In order to create a heatmap, we first need first transformat out data into a DataFrame that has appropirate rows and columns. One way we can do this is to use the pandas .pivot_table(index = , columns = , values = , aggfunc = ) method, where the arguments to this method are:
index: The variable we want in the rows of out DataFramecolumns: The variable we want in the columns of our DataFramevalues: The values we want to be in the DataFrameaggfunc: The function we will use to aggregate our data
Let's apply the .pivot_table() method to our gapmider data to create a DataFrame called gapminder_continent_wide where:
- The rows are the different continents
- The columns are the year
- The values in the DataFrame are the average life expectancy (For each continent in each year)
# Generate a pivot table from the gapminder data
gapminder_continent_wide = gapminder.pivot_table(index = 'continent',
columns = 'year',
values = 'lifeExp',
aggfunc = 'mean')
gapminder_continent_wide.head()
| year | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| continent | ||||||||||||
| Africa | 39.135500 | 41.266346 | 43.319442 | 45.334538 | 47.450942 | 49.580423 | 51.592865 | 53.344788 | 53.629577 | 53.598269 | 53.325231 | 54.806038 |
| Americas | 53.279840 | 55.960280 | 58.398760 | 60.410920 | 62.394920 | 64.391560 | 66.228840 | 68.090720 | 69.568360 | 71.150480 | 72.422040 | 73.608120 |
| Asia | 46.314394 | 49.318544 | 51.563223 | 54.663640 | 57.319269 | 59.610556 | 62.617939 | 64.851182 | 66.537212 | 68.020515 | 69.233879 | 70.728485 |
| Europe | 64.408500 | 66.703067 | 68.539233 | 69.737600 | 70.775033 | 71.937767 | 72.806400 | 73.642167 | 74.440100 | 75.505167 | 76.700600 | 77.648600 |
| Oceania | 69.255000 | 70.295000 | 71.085000 | 71.310000 | 71.910000 | 72.855000 | 74.290000 | 75.320000 | 76.945000 | 78.190000 | 79.740000 | 80.719500 |
Now that we have the appropriate DataFrame, let's use the plotly imshow() function to visualize it!
# use plotly imshow() to visualize the pivot table
fig = px.imshow(gapminder_continent_wide)
fig.update_layout(xaxis_title = "Year", yaxis_title = "")
# We can create heatmaps in seaborn as well
g = sns.heatmap(gapminder_continent_wide,
annot=True,
fmt=".0f");
g.set_xlabel("");
g.set_ylabel("");